Mitchell Morrison, Kyle Kolodziej, Brian Pattison
For this lab on exploring image data we picked the multiclass weather dataset from kaggle (https://www.kaggle.com/pratik2901/multiclass-weather-dataset). It was originally collected in 2018 from the Univserity of South Africa for their science and technology campus. The dataset contains 4 different sets of weather classes, each containing nearly 250 images. The 4 classes of weather images are sunrise, shine, rain, and cloudy images.
This data is important because it can assist in classifying real time weather images or feeds from APIs. Many smart homes today factor in real time weather feed to control settings such as in-home lighting and watering (primarily sprinklers) based on external conditions. Creating a classification model for weather images could be used to populate data on a weather application or serve as an API for the current condintions in an area. With this information, smart homes would call the API to receive real time information and then act accordingly.
We expect smart homes would only employ this technology if it meets an accuracy of above at least 80%. Digital assistants like Alexa and Siri are shown to accurately answer 80 and 83 percent of queries, respectively, according to a Business Insider study in 2019 (https://www.businessinsider.com/amazon-bolsters-alexa-skill-voice-accuracy-2019-10). Although this is not a tell all metric for the necessary success rates of smart home devices, we plan on using this number to indicate if our classification is headed in the right direction.
Based on market data, we believe that individuals would pay for a service to help assist in their smart home. We have found that smart irrigation technology (letting nature do the work when it rains rather than paying for watering it yourself) can save houses up to 40% on their water usage (https://www.gardeningknowhow.com/garden-how-to/watering/what-is-smart-irrigation.htm).
One caveot to our overall prediction accuracy is that each different weather type has a different accuracy. This could mean that although we meet a certain level of accuracy overall, it does not mean each type of weather classification is to that level of accuracy. If that is the case, we will explore it further in our analysis.
import pandas as pd
import numpy as np
import os
import glob
from matplotlib import pyplot as plt
import matplotlib.image as mpimg
from PIL import Image
def getAndResizeImages(file_path) :
images = []
for file in os.listdir(file_path) :
path = file_path + '/' + file
im = Image.open(path)
im = im.resize((100, 100))
images.append(im)
return images
folderPath = "../../weather"
types = ['Sunrise', 'Shine', 'Rain', 'Cloudy']
folderPaths = [folderPath + '/Sunrise', folderPath + '/Shine', folderPath + '/Rain', folderPath + '/Cloudy']
resizedImages = []
for path in folderPaths :
resizedImages.append(getAndResizeImages(path))
allImages = []
for img in resizedImages:
for i in img:
allImages.append(i)
print(len(allImages), "total images")
for images, title in zip(resizedImages, types) :
print(title, ":", len(images), "images in this set")
for images, title in zip(resizedImages, types) :
plt.figure(figsize=(20,20))
for i in range(4):
img=images[i]
ax=plt.subplot(1,4,i+1)
plt.title(title)
plt.imshow(img)
from numpy import asarray
from sklearn.decomposition import PCA
# np_image =[[asarray(image).flatten() for image in ists] for lists in resizedImages]
X = []
y = []
np_image = []
for idx, setting in enumerate(resizedImages):
pics = []
for image in setting:
i = asarray(image).flatten()
if len(i) == 30000:
pics.append(i)
X.append(i)
y.append(idx)
np_image.append(pics)
#to numpy array and linearized in one line
def plot_gallery(images, h, w, n_row=1, n_col=4):
"""Helper function to plot a gallery of portraits"""
plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[i].reshape((h, w)), cmap = plt.cm.gray)
plt.xticks(())
plt.yticks(())
def plot_explained_variance(pca):
import plotly
from plotly.graph_objs import Bar, Line
from plotly.graph_objs import Scatter, Layout
from plotly.graph_objs.scatter import Marker
from plotly.graph_objs.layout import XAxis, YAxis
plotly.offline.init_notebook_mode() # run at the start of every notebook
explained_var = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(explained_var)
plotly.offline.iplot({
"data": [Bar(y=explained_var, name='individual explained variance'),
Scatter(y=cum_var_exp, name='cumulative explained variance')
],
"layout": Layout(xaxis=XAxis(title='Principal components'.format(setting)), yaxis=YAxis(title='Explained variance ratio'))
})
# lets do some PCA of the features and go from 1120 features to 20 features
from sklearn.decomposition import PCA
n_components = 200
h=200
w=150
print ("Extracting the top %d eigenweathers from %d weather images" % (
n_components, len(X)))
pca = PCA(n_components=n_components)
%time pca.fit(X.copy())
X_pca_features = pca.components_.reshape((n_components, h, w))
plot_explained_variance(pca)
The PCA analysis above details the cumulative explained variance ratio from our 1125 weather images. We can see that to explain 95% of the variance for each type of image we require at ~200 principle components. We felt that 95% of variance was enough to accurately represent a majority of the dataset in many less features.
plot_gallery(X_pca_features, h, w)
Although these images do not seem to resemble any of our weather images, these represent the most common features in our different kinds of images.
# lets do some RPCA of the features and go from 1120 features to 20 features
n_components = 200
h=200
w=150
print ("Extracting the top %d eigenweathers from %d weather images" % (
n_components, len(X)))
rpca = PCA(n_components=n_components, svd_solver='randomized')
%time rpca.fit(X.copy())
rpca_features = rpca.components_.reshape((n_components, h, w))
In terms of performance, both PCA and RPCA take between 5 and 7 seconds to fit the data from the PCA analysis.
plot_explained_variance(rpca)
The RPCA analysis graph above indicates that in order to explain ~95% of the features of our dataset RPCA requires nearly 200 prinicple components. This is nearly identical to the PCA analysis done previously. From each analysis we can draw the conclusion that PCA and RPCA require similar amount of components to perform the same on our dataset.
plot_gallery(rpca_features, h, w)
The gallery above shows the most common eigenweather values from RPCA. These images are nearly identical to the PCA eigenweather values, when placed side by side.
import copy
# transforming features
pca_features = pca.transform(copy.deepcopy(X))
rpca_features = rpca.transform(copy.deepcopy(X))
from sklearn.model_selection import train_test_split
pca_train, pca_test, rpca_train, rpca_test, y_train, y_test = train_test_split(
pca_features,rpca_features,y,test_size=0.2, train_size=0.8)
# quantitative measure of performance using K Nearest Neighbors Classifier test
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
knn_pca = KNeighborsClassifier(n_neighbors=1)
knn_rpca = KNeighborsClassifier(n_neighbors=1)
knn_pca.fit(pca_train,y_train)
acc_pca = accuracy_score(knn_pca.predict(pca_test),y_test)
knn_rpca.fit(rpca_train,y_train)
acc_rpca = accuracy_score(knn_rpca.predict(rpca_test),y_test)
print(f"PCA accuracy:{100*acc_pca:.2f}%, RPCA Accuracy:{100*acc_rpca:.2f}%".format())
Once we transformed our data from both PCA and RPCA, we were able to split our dataset for test and training data. We used our split data to conduct a KNN classification, where we fit the models to the respective training data and calculated the accuracy of each model.
Over numerous iterations we have found that PCA and RPCA perform extremely similarly when using the same number of components. On occasion when we have ran RPCA with less components than PCA (150 components), it still performs at a similar level of accuracy. However, on a few tests we saw it perform nearly 3 - 4 percent worse than PCA, so rather than sacrificing accuracy of the model for speed we kept both PCA and RPCA using 200 components.
As listed in our business understanding, we felt that 80% accuracy on our classification model would indicate that we have created a useful model. As stated in the paragraph above, keeping our principal components at ~200 consistently gave us results of ~80% accuracy.
Although it is faster to get eigenvectors for a lower rank covariance matrix with RPCA, the time difference is not significant enough on our tests to say RPCA is much faster.
def comp_reconstructed(resized_images, images, idx, setting, isRandom) :
X_idx = images[idx]
low_dimensional_pca, reconstructed_image_pca = reconstruct_image(pca,X_idx.reshape(1, -1))
low_dimensional_rpca, reconstructed_image_rpca = reconstruct_image(rpca,X_idx.reshape(1, -1))
w = 150
h = 200
plt.figure(figsize=(15,7))
plt.subplot(1,3,1)
im = resized_images[idx].resize((w, h))
plt.imshow(im)
plt.title(setting)
plt.grid(False)
plt.subplot(1,3,2)
plt.imshow(reconstructed_image_pca.reshape((h, w)))
plt.title(f'{setting} Reconstructed from Full PCA')
plt.grid(False)
plt.subplot(1,3,3)
plt.imshow(reconstructed_image_rpca.reshape((h, w)))
plt.title(f'{setting} Reconstructed from RPCA')
plt.grid(False)
def reconstruct_image(trans_obj,org_features):
low_rep = trans_obj.transform(org_features)
rec_image = trans_obj.inverse_transform(low_rep)
return low_rep, rec_image
for resized, images, setting in zip(resizedImages, np_image, types) :
for i in range(1) :
comp_reconstructed(resized, images, i, setting, False)
These images are indications that PCA and RPCA reconstruct very similarly. For reference, the original image is shown on the left.
from skimage.feature import daisy
from skimage.io import imshow
idx_to_reconstruct = 10
img = X[idx_to_reconstruct].reshape((h,w))
# lets first visualize what the daisy descriptor looks like
features, img_desc = daisy(img,
step=40,
radius=30,
rings=3,
histograms=8,
orientations=4,
visualize=True)
imshow(img_desc)
plt.grid(False)
# example of applying DAISY to image
# create a function to tak in the row of the matrix and return a new feature
# adjusted daisy parameters because smaller steps and radius add computation without increasing our accuracy
def apply_daisy(row,shape):
feat = daisy(row.reshape(shape), step=40, radius=30,
rings=3, histograms=8, orientations=4,
visualize=False)
return feat.reshape((-1))
%time test_feature = apply_daisy(X[3],(h,w))
test_feature.shape
# apply to entire data, row by row,
# takes about a minute to run
%time daisy_features = np.apply_along_axis(apply_daisy, 1, X, (h,w))
print(daisy_features.shape)
from sklearn.metrics.pairwise import pairwise_distances
# find the pairwise distance between all the different image features
%time dist_matrix = pairwise_distances(daisy_features)
import copy
# find closest image to current image
idx1 = 200
distances = copy.deepcopy(dist_matrix[idx1,:])
distances[idx1] = np.infty # dont pick the same image!
idx2 = np.argmin(distances)
plt.figure(figsize=(7,10))
plt.subplot(1,2,1)
plt.imshow(X[idx1].reshape((h,w)))
plt.title("Original Image")
plt.grid()
plt.subplot(1,2,2)
plt.imshow(X[idx2].reshape((h,w)))
plt.title("Closest Image")
plt.grid()
The plot above shows the original image and the closest image that DAISY found to the first. This relationship shows us the similarities in features and helps us better understand just how daisy works.
# transform the models using the X data
rpca_features = rpca.transform(copy.deepcopy(X))
pca_features = pca.transform(copy.deepcopy(X))
# daisy features created above
# running K Nearest Neighbors algorithm again to compare against Daisy feature extraction
knn_rpca = KNeighborsClassifier(n_neighbors=1)
knn_pca = KNeighborsClassifier(n_neighbors=1)
knn_dsy = KNeighborsClassifier(n_neighbors=1)
rpca_train, rpca_test, pca_train, pca_test, dsy_train, dsy_test, y_train, y_test = train_test_split(
rpca_features,pca_features,daisy_features,y, test_size=0.2, train_size=0.8)
knn_rpca.fit(rpca_train,y_train)
y_pred_rpca = knn_rpca.predict(rpca_test)
acc_rpca = accuracy_score(y_pred_rpca,y_test)
knn_pca.fit(pca_train,y_train)
y_pred_pca = knn_pca.predict(pca_test)
acc_pca = accuracy_score(y_pred_pca, y_test)
knn_dsy.fit(dsy_train,y_train)
y_pred_dsy = knn_dsy.predict(dsy_test)
acc_dsy = accuracy_score(y_pred_dsy,y_test)
print(f"PCA accuracy:{100*acc_pca:.2f}%, RPCA accuracy:{100*acc_rpca:.2f}%, Daisy Accuracy:{100*acc_dsy:.2f}%".format())
The accuracies above indicate that DAISY feature extraction is a worse classfier than PCA and RPCA for our dataset. In this case, DAISY performs nearly 17% worse than PCA and RPCA. Therefore, DAISY does not show promise for our dataset given the KNN test run on each classifier.
In terms of performance, DAISY feature extraction takes significantly longer on our dataset than PCA and RPCA, without the added benefit of a better classifier. From these tests alone we can see that although DAISY is on the right track with the weather classification dataset, PCA and RPCA perform much better consistently.
We will look further into how each of these classifers performs using heat maps.
from sklearn.metrics import confusion_matrix
cm_rpca = confusion_matrix(y_test, y_pred_rpca, labels=range(4))
cm_pca = confusion_matrix(y_test, y_pred_pca, labels=range(4))
cm_dsy = confusion_matrix(y_test, y_pred_dsy, labels=range(4))
print(cm_pca)
print(cm_rpca)
print(cm_dsy)
These confusion matrices tell exactly how each image class was classified using a 2d matrix. See the heat maps below for more detail.
import seaborn as sns
plt.figure(figsize=(15,5))
axes = ['sunrise', 'sunshine', 'rain', 'cloudy']
x_points = np.array([0, 1, 2, 3])
plt.subplot(1,3,1)
sns.heatmap(cm_rpca, linewidth=.5)
plt.title('RPCA')
plt.xticks(x_points, axes)
plt.yticks(x_points, axes)
plt.grid(False)
plt.subplot(1,3,2)
sns.heatmap(cm_pca, linewidth=.5)
plt.title('PCA')
plt.xticks(x_points, axes)
plt.yticks(x_points, axes)
plt.grid(False)
plt.subplot(1,3,3)
sns.heatmap(cm_dsy, linewidth=.5)
plt.title('DAISY')
plt.xticks(x_points, axes)
plt.yticks(x_points, axes)
plt.grid(False)
plt.show()
The heatmaps above depict how each classifier predicts each kind of image data and to what degree of accuracy. PCA and RPCA have identical heatmaps with the same number of principal components.
As seen in the heat map, both PCA and RPCA perform extremely well on the sunrise class. It was also DAISY's most accurate prediction class, but still less accurate than PCA and RPCA. Following sunrise, sunshine was predicted to the highest level accuracy by all 3 classifiers. Arguably the most important class to our business understanding, rain, was the most difficult for all classifiers to properly predict. In multiple cases rain images were classifed as cloudy, giving our rain accuracy prediction in PCA and RPCA of around 65%. This is below our goal, and tells us that we need to better predict rain images and find distinctions between images that are rain but predicted to be cloudy. Finally, our cloudy images were sometimes misclassified as each different type of weather, but primarily rain.
DAISY's heat map clearly indicates that it has a poor classifier for nearly all types of images except the sunrise class, as seen by the many purple squares where classes were incorrectly classified.
Overall, PCA and RPCA have remarkably similar accuracies and ability to predict each class of data. In the future with this dataset we would choose PCA or RCPA
For our exceptional work, we compared different images from our data set using key point matching. It was a brute force matching using orb descriptors.
import numpy as np
import cv2 as cv
from matplotlib import pyplot as plt
from skimage.feature import match_descriptors
def compareKeyPoints(idx1, idx2, idx3):
img1 = np.array(allImages[idx1])
# Initiate ORB detector
orb = cv.ORB_create()
# find the keypoints with ORB
kp = orb.detect(img1,None)
# compute the descriptors with ORB
kp1, des1 = orb.compute(img1, kp)
# draw only keypoints location,not size and orientation
imgkp = cv.drawKeypoints(img1, kp1, None, color=(0,255,0), flags=0)
plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
plt.title('Image 1')
plt.imshow(imgkp)
img2 = np.array(allImages[idx2])
# Initiate ORB detector
orb = cv.ORB_create()
# find the keypoints with ORB
kp = orb.detect(img2,None)
# compute the descriptors with ORB
kp2, des2 = orb.compute(img2, kp)
imgkp = cv.drawKeypoints(img2, kp2, None, color=(0,255,0), flags=0)
plt.subplot(1,3,2)
plt.title('Image 2')
plt.imshow(imgkp)
img3 = np.array(allImages[idx3])
# Initiate ORB detector
orb = cv.ORB_create()
# find the keypoints with ORB
kp = orb.detect(img3,None)
# compute the descriptors with ORB
kp3, des3 = orb.compute(img3, kp)
imgkp = cv.drawKeypoints(img3, kp3, None, color=(0,255,0), flags=0)
plt.subplot(1,3,3)
plt.title('Image 3')
plt.imshow(imgkp)
plt.show()
# return list of the key points indices that matched closely enough
matches = match_descriptors(des1, des2, cross_check=True, max_ratio=0.9)
print(f"Number of matches between Image 1 and 2, same class: {matches.shape[0]}, Percentage:{100*matches.shape[0]/len(des1):0.2f}%")
# return list of the key points indices that matched closely enough
matches = match_descriptors(des1, des3, cross_check=True, max_ratio=0.9)
print(f"Number of matches between Image 1 and 3, diff class: {matches.shape[0]}, Percentage:{100*matches.shape[0]/len(des1):0.2f}%")
# return list of the key points indices that matched closely enough
matches = match_descriptors(des2, des3, cross_check=True, max_ratio=0.9)
print(f"Number of matches between Image 2 and 3, diff classes: {matches.shape[0]}, Percentage:{100*matches.shape[0]/len(des2):0.2f}%")
In order to write this function to perform key point matching between three images, we used these websites: https://docs.opencv.org/master/d1/d89/tutorial_py_orb.html https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_matcher/py_matcher.html
In running trials of key point analysis, we noticed that there were not many matches between two images that were being compared, even if they were from the same class. With all the images only having a few matches at most, it was difficult to analyze how the key point matching compared between images in the various classes. To be able to differentiate comparisons of images from the same vs. different classes, we decided to increase the max_ratio to 0.9 from its standard value of 0.8. Increasing the maximum ratio of distances between the closest descriptors allowed our key point comparisons to find more matches in all comparisons. It especially found more matches in comparisons between images in the same class. Having this increase made the difference in key point matches between the various classes much more visible in our experimentation of running this function.
compareKeyPoints(618,621,1118)
This graphic here is comparing two images of rain, Image 1 and 2, to a cloudy image, Image 3. Between Image 1 and 2, there are 9 shared key points and they had 13.04% similar points. When comparing Image 3 to both Images 1 and 2, the shared key points and percentage was a lot less. This is important because it shows that key point analysis is more likely to find similarities between rainy images and less likely to find many similarities between cloudy and rainy images.
compareKeyPoints(230,240,369)
Our RPCA and PCA analysis showed that we were able to differentiate between sunrise and sunshine images with a high level of accuracy. It's interesting to note that in doing key point analysis, we see that it finds more simalarities between these two types of images than expected, but it still finds a higher similarity percentage between the two sunrise images.
compareKeyPoints(370,439,659)
We also compared two sunny images, Image 1 and 2, to a rainy image, Image 3. Our RPCA and PCA classification didn't have issues distinguishing between these two types of images so it is good to see that neither did our key point analysis. It is worth noting that Image 2 and Image 3 shared 4 key points, more than Image 1 and 2. However, since Image 1 and Image 2 have less total key points between the two of them, the percentage similarity between them was higher.
Overall, key point analysis assists in finding more similarties between images and visually shows where those similarites lie. It helps us dive deeper into classification of images and potential misclassifications of images. Generally, PCA and RPCA are great distinguishers in our weather classification, but we can find additional similarities between images when using key point analysis.